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Abstract 

We propose a new one-sample test for normality in a Reproducing Kernel Hilbert Space 
(RKHS). Namely, we test the null-hypothesis of belonging to a given family of Gaussian 
distributions. Hence our procedure may be applied either to test data for normality or 
to test parameters (mean and covariance) if data are assumed Gaussian. Our test is 
based on the same principle as the MMD (Maximum Mean Discrepancy) which is usually 
used for two-sample tests such as homogeneity or independence testing. Our method 
makes use of a special kind of parametric bootstrap (typical of goodness-of-fit tests) 
which is computationally more efficient than standard parametric bootstrap. Moreover, 
an upper bound for the Type-H error highlights the dependence on influential quantities. 
Experiments illustrate the practical improvement allowed by our test in high-dimensional 
settings where common normality tests are known to fail. We also consider an application 
to covariance rank selection through a sequential procedure. 
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1 Introduction 

Non-vectorial data such as DNA sequences or pictures often require a positive semi-definite 
kernel [1] which plays the role of a similarity function. For instance, two strings can be compared 
by counting the number of common substrings. Further analysis is then carried out in the 
associated reproducing kernel Hilbert space (RKHS), that is the Hilbert space spanned by the 
evaluation functions k{x,.) for every x in the input space. Thus embedding data into this RKHS 
through the feature map x h-)■ k{x ,.) allows to apply linear algorithms to initially non-vectorial 
inputs. 

Embedded data are often assumed to have a Gaussian distribution. For instance supervised 
and unsupervised classification are performed in [6] by modeling each class as a Gaussian 
process. In [32], outliers are detected by modelling embedded data as a Gaussian random 
variable and by removing points lying in the tails of that Gaussian distribution. This key 
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assumption is also made in [37] where a mean equality test is used in high-dimensional setting. 
Moreover, Principal Component Analysis (PCA) and its kernelized version Kernel PCA [34] 
are known to be optimal for Gaussian data as these methods rely on second-order statistics 
(covariance). Besides, a Gaussian assumption allows to use Expectation-Minimization (EM) 
techniques to speed up PGA [33]. 

Depending on the (hnite or inhnite dimensional) structure of the RKHS, Gramer-von-Mises- 
type normality tests can be applied, such as Mardia’s skewness test [29], the Henze-Zirkler test 
[22] and the Energy-distance test [40]. However these tests become less powerful as dimension 
increases (see Table 3 in [40]). An alternative approach consists in randomly projecting high¬ 
dimensional objects on one-dimensional directions and then applying univariate test on a few 
randomly chosen marginals [24]. This projection pursuit method has the advantage of being 
suited to high-dimensional settings. On the other hand, such approaches also suffer a lack of 
power because of the limited number of considered directions (see Section 4.2 in [24]). 

In the RKHS setting, [17] introduced the Maximum Mean Discrepancy (MMD) which quan- 
tihes the gap between two distributions through distances between two elements of an RKHS. 
The MMD approach has been used for two-sample testing [17] and for independence testing 
(Hilbert Space Independence Griterion, [20]). However to the best of our knowledge, MMD has 
not been applied in a one-sample goodness-of-£t testing framework. 

The main contribution of the present paper is to provide a one-sample statistical test of 
normality for data in a general Hilbert space (which can be an RKHS), by means of the MMD 
principle. This test features two possible applications: testing the normality of the data but also 
testing parameters (mean and covariance) if data are assumed Gaussian. The latter application 
encompasses many current methods that assume normality to make inferences on parameters, 
for instance to test the nullity of the mean [37] or to assess the sparse structure of the covariance 
[39, 2]. 

Once the test statistic is dehned, a critical value is needed to decide whether to accept 
or reject the Gaussian hypothesis. In goodness-of-ht testing, this critical value is typically 
estimated by parametric bootstrap. Unfortunately, parametric bootstrap requires parameters to 
be computed several times, hence heavy computational costs (he. diagonalization of covariance 
matrices). Our test bypasses the recomputation of parameters by implementing a faster version 
of parametric bootstrap. Following the idea of [26], this fast bootstrap method ’’linearizes” the 
test statistic through a Frechet derivative approximation and thus can estimate the critical 
value by a weighted bootstrap (in the sense of [8]) which is computationally more efhcient. 
Furthermore our version of this bootstrap method allows parameters estimators that are not 
explicitly ’’linear” (he. that consist of a sum of independent terms) and that take values in 
possible inhnite-dimensional Hilbert spaces. 

Finally, we illustrate our test and present a sequential procedure that assesses the rank of a 
covariance operator. The problem of covariance rank estimation is adressed in several domains: 
functional regression [9, 7], classification [41] and dimension reduction methods such as PGA, 
Kernel PGA and Non-Gaussian Gomponent Analysis [3, 12, 13] where the dimension of the 
kept subspace is a crucial problem. 
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Here is the outline of the paper. Section 2 sets our framework and Section 3 introduces 
the MMD and how it is used for our one-sample test. The new normality test is described in 
Section 4, while both its theoretical and empirical performances are detailed in Section 5 in 
terms of control of Type-I and Type-II errors. A sequential procedure to select covariance rank 
is presented in Section 6. 


2 Framework 

Let {T-L,A) be a measurable space, and Yi,... G "H denote a sample of independent and 
identically distributed {i.i.d.) random variables drawn from an unknown distribution P ^ V, 
where P is a set of distributions defined on A. 

In our framework, "H is a separable Hilbert space endowed with a dot product ( . , . )^ 
and the associated norm ||.||-^ (defined by ||h||-^ = {h,h)]i‘^ for any h G l-i). Our goal is to test 
whether Yi is a Gaussian random variable (r.v.) of Ti, which is dehned as follows. 

Definition 2.1. (Gaussian random variable in a Hilbert space) 

Let (O, P) a measure space, (H, P ) a measurable space where H is a Hilbert space, and 

Y : fl ^ H a measurable map. 

Y is a Gaussian r.v. ofH if {Y, is a univariate Gaussian r.v. for any h eH. 

Assuming that EyUHH^ < -|-cx), there exists m E H such that: 

yhEU, {m,h)n = ^Y{Y,h)n , 

and a (finite trace) operator T, -.H satisfying: 

yh,h' eH, {Eh,h')n = cov{{Y,h)n,{Y,h')n) ■ 

m and S are respectively the mean and the covariance operator ofY. The distribution ofY is 
denoted Af{m, S). 

More precisely, the tested hypothesis is that Yi follows a Gaussian distribution Af{mo, Sq), 
where (mo, Sq) G ©o and ©o is a subset of the parameter space ©. ^ Following [28], let us 
define the null hypothesis Hq : P EVq, and the alternative hypothesis Hi ; P ^V\Vq where 
the subset of null-hypotheses Vq YV is 

Vo = {A/"(mo. So) I (mo. So) G ©o} • 

^ The parameter space 0 is endowed with the dot product ((to, S), (to , S ))e = (to, m)u + (S, E )HS{'H)i 
where HS(H) is the space of Hilbert-Schmidt (finite trace) operators LL ^ LL and {Y,P)hs{h) = 
ed« for any complete orthonormal basis (ei)i>i of H. Therefore, for any 0 G 0, the tensor 
product 0®^ is defined as the operator 0 —)• 0, 0' {6, 0')q9. For any 0 G Q and h G H{k), the tensor product 

hiS) 9 is the operator 0 —>■ H(k), 9 i-G {9,9 )Qh. 
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The purpose of a statistical test T(Yi,..., Yn) of Hq against Hi is to distinguish between the 
null (Hq) and the alternative (Hi) hypotheses. It requires two elements: a statistic nA^ (which 
we dehne in Section 4.1) that measures the gap between the empirical distribution of the data 
and the considered family of normal distributions Vo, and a rejection region TZa (at a level 
of conhdence 0 < a < 1). Ho is accepted if and only if riA^ ^ TZa- The rejection region is 
determined by the distribution of nA^ under the null-hypothesis such that the probability of 
wrongly rejecting Hq (Type-I error) is controlled by a. 

3 The Maximum Mean Discrepancy (MMD) 

Following [17] the gap between two distributions P and Q on "H is measured by 

A(F, Q) = sup lEy^pfiY) - Ez^Qf{Z)\, (3.1) 

where F is a class of real valued functions on "H. Regardless of P, (3.1) always dehnes a 
pseudo-metric ^ on probability distributions [36]. 

The choice of P is subject to two requirements: (i) (3.1) must dehne a metric between 
distributions, that is 

VP, Q, A(P, g) = 0 ^ P = Q , (3.2) 

and (ii) (3.1) must be expressed in an easy-to-compute form (without the supremum term). 

To solve those two issues, several papers [18, 19, 20] have considered the case when P is 
the unit ball of a reproducing kernel Hilbert space (RKHS) H{}z) associated with a positive 
semi-dehnite kernel fc : "H x "H —M. 

Definition 3.1. (Reproducing Kernel Hilbert space, [1]) Let k be a positive semi-definite kernel, 

i.e. 


Vxi,..., Xfi G d~L, Voi,..., oiyi, ^ ^ o^iOijki^Xi, P 0 , 

with equality if and only if ai = ... = an = 0. 

There exists a unique Hilbert space H(k) of real-valued functions on Ti which satisfies: 

• Vx G P, k{x, .) G H{k) , 

• Wfen, \/xeX, {f,k{x,.))H(k) = f{x) . 

H{k) is the reproducing kernel Hilbert space (RKHS) ofk. 

^ A pseudo-metric A(.,.) satisfies for any P, Q,R : (i) A{P,P) = 0, (ii) A{P,Q) = A{Q,P), and (iii) 
A{P, R) < A{P,Q) + A{Q, R). 
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Let ||.||ff(fc) = be the norm of H(k) and Bi(k) = {/ G H(k) \ ||/||//(fc) < 1} denote 

the nnit ball of H{k). When = Biik), A(-, •) becomes a metric only for a class of kernels k 
that are called characteristic. 

Definition 3.2. (Characteristic kernel, [15]) 

Let IF = Bi{k) in (3.1) for some kernel k. Then k is a characteristic kernel if A{P,Q) = 0 
implies P = Q. 

Most common kernels are characteristic: Gaussian kernels k{x,y) = exp(—(j||a; — y\\‘^) 
where a > 0, the exponential kernel k{x,y) = exp{{x,y)'u) and Student kernels k{x,y) = 
{l+aWx-yW^)-^ where a, a > 0, to name a few. Several criteria for a kernel to be characteristic 
exist (see [36, 16, 11]). 

Moreover taking F = Bi(k) enables to cast A{P,Q) as an easy to compute quantity. This 
is done by embedding any distribution P in the RKHS H{k) as follows. 

Definition 3.3. (Hilbert space embedding, Lemma 3 from [18]) Let P be a distribution such 
that ¥.Y^pk^^‘^{Y,Y) < +cx). 

Then there exists yp E H(k) such that for every f G H(k), 

(hP,/)H(fc)=E/(F) . 

fip is called the Hilbert space embedding of P in H{k). 

Thus A(P, Q) can be expressed as the gap between the Hilbert space embeddings of P and 
Q (Lemma 4 in [18]): 


A(F,g)= sup \Epf{Y)-EQf{Z)\ 
sup \{flp - TQ,f)H{k)\ 

= WfLp - TqWhF ■ (3-3) 

(3.3) is called the Maximum Mean Discrepancy (MMD) between P and Q. 

Within our framework the goal is to compare P the true distribution of the data with a 
Gaussian distribution Pq = Af{mo, Sq) for some (mo, Sq) G ©q. Hence the quantity of interest 

is 

= IlhP “ hPollH(fc) • (3-4) 

For the sake of simplicity, we use the notation 

hAr(m,s) = N[m, E] 

to denote the Hilbert space embedding of a Gaussian distribution. 
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Algorithm 1 Kernel Normality Test procedure 

Input: Yi,..., G H X H —)■ M (kernel) and 0 < a < 1 (test level). 

1. Compute K = [ (T), T)) (Gram matrix). 

2. Compute nA^ (test statistic) from (4.5) that depends on K and k (Section 4.1) 

3. (a) Draw B (approximate) independent copies of under Hq by fast parametric 

bootstrap (Section 4.2). 

(b) Compute qa,n (1 — a quantile of nA^ under Hq) from these replications. 

Output: Reject Hq if nA^ > qa,n, and accept otherwise. 


4 Kernel normality test 

This section introduces our one-sample test for normality based on the quantity (3.4). As said 
in Section 2, we test the null-hypothesis Hq ; P G {7V(mo, Sq) | (mo, Sq) G ©o} where ©o is 
a subset of the parameter space. Therefore our procedure may be used as test for normality 
or a test on parameter if data are assumed Gaussian. The test procedure is summed up in 
Algorithm 1. 

4.1 Test statistic 

As in [17], A^ can be estimated by replacing flp with the sample mean 

n 

fiP = Jlp = (l/n) ^ k{Yi ,.) , 
i=l 

where P = (l/n) is the empirical distribution. The null-distribution embedding 

A[mo,So] is estimated by A[m, S] where rfi and S are appropriate and consistent (under 
Ho) estimators of mo and So. This yields the estimator 

A^ = ll^p - A[?h,S]||^(^) , 

which can be explicited by expanding the square of the norm and using the reproducing property 
of H{k) as follows 

1 " - 2 

= + l|A'|m,E]ir„,j, . (4.5) 

i,j=l i=l 

Proposition 4.1 ensures the consistency of the statistic (4.5). 

Proposition 4.1. Assume that P is Gaussian MimQ^'Lo) where (mo. So) G ©o and (m,S) are 
consistent estimators o/(mo. So). Also assume that Kpk{Y,Y) < + oo and A[m, S] is a 

continuous function of (m, S) on ©q. Then A^ is a consistent estimator of AP. 
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Proof. First, note that /xp exists since E,k{Y, Y) < +cx) implies Y) < +cx). By the Law 

of Large Numbers in Hilbert Spaces [23], p,p —> jlp P-almost surely since E||fc(F, ■)—pp\\'%(u\ = 

Kk(Y, Y) — KkiY, Y') < KkiY, Y) + K‘^k(Y, Y') < +oo. The continuity of N[m, E] (with respect 
to (m, S)) and the consistency of (rh, E) yield A^[m, S] iV[mo, Eq] P-a.s.. Finally, the 

n^oo 

continuity of || ■ ||^ leads to q 

n^oo 

The expressions for N[m, E](Fj) and ||iV[m, E] in (4.5) depend on the choice of k. Those 
are given by Propositions 4.2 and 4.3 when k is Gaussian and exponential. Note that in these 
cases, the continuity assumption of iV[m, E] required by Proposition 4.1 is satished. 

Before stating Propositions 4.2 and 4.3, the following notation is introduced. For a sym¬ 
metric operator L : PL ^ PL with eigenexpansion L = X]r>i its determinant is denoted 

\L\ = HrM For any g G M, the operator L'^ is dehned as 

Proposition 4.2. (Gaussian kernel case) Let k{., .) = exp(—cr||. — .||^) where a > 0. Then, 

A'|m,E](.) = |/ + 2fff:|"‘''^exp (-ff||(/ + 2 ctE)"‘''^(. - m)||L , 
\\N\m,t]\\l^^^ = \l + ^at\-^l^ . 

Proposition 4.3. (Exponential kernel case) Let k{.,.) = exp((., .)-p). Assume that the largest 
eigenvalue ofE is smaller than 1. Then, 

N[m,f:]{.) = exp + , 

||iV[m,E]||2 = |/-E2|-i/2exp('||(j_S2)-i/2^||^j _ 

The proofs of Propositions 4.2 and 4.3 are provided in Appendix B.l. 

For most estimators (m, E), the quantities provided in Propositions 4.3 and 4.2 are com¬ 
putable via the Gram matrix K = [(1^, For instance, asumme that (m, E) are the 

classical estimators (m, E) where m = (l/'n.) ^ ~ Let In 

and Jn be respectively the n x n identity matrix and the n x n matrix whose all entries equal 
1, H = In — {l/n)Jn and K^, = HKH he the centered Gram matrix. Then for any □ G M, 


J + DE =det (ln + -K, 
\ n 

where det(.) denotes the determinant of a matrix and 

(/ + n±)-^/^Yi 


{In H- Kc) 

n 


-1 


1,1 


where [.{a denotes the entry in the Gth row and the Tth column of a matrix. 










4.2 Estimation of the critical value 

Designing a test with confidence level 0 < a < 1 reqnires to compnte the 1 — a qnantile of 
the riA^ distribntion nnder Hq denoted by Thus qa,n serves as a critical value to decide 
whether the test statistic is significantly close to 0 or not, so that the probability of wrongly 
rejecting Hq (Type-I error) is at most a. 

4.2.1 Classical parametric bootstrap 

In the case of a goodness-of-fit test, a usual way of estimating „ is to perform a parametric 
bootstrap. Parametric bootstrap consists in generating B samples of n i.i.d. random variables 
... ,Yn^'^ ~ {b = Each of these B samples is used to compute a 

bootstrap replication 

|nAy = n||A‘p-/Vlm‘,E‘|||J„t, , (4.6) 

where /ip, rh^ and are the estimators of /ip, m and S based on Y ^,..., Y^. 

It is known that parametric bootstrap is asymptotically valid [38]. Namely, under Hq, 

V6 = l,...,5, (nA^, [nAYl {U,U') , 

V / n^+oo 

where U and U' are i.i.d. random variables. In a nutshell, (4.6) is approximately an inde¬ 
pendent copy of the test statistic nA^ (under Hq). Therefore B replications [nA^]^ can be used 
to estimate the 1 — a quantile q^^n of nA^ under the null-hypothesis. 

However, this approach suffers heavy computational costs. In particular, each bootstrap 
replication involves the estimators (fh^, TP). In our case, this leads to compute the eigendecom- 
position of the B Gram matrices of size n x n hence a complexity of order 

0{Bn^). 

4.2.2 Fast parametric bootstrap 

This computational limitation is alleviated by means of another strategy described in [26]. Let 
us consider in a first time the case when the estimators of m and E are the classical empirical 
mean and covariance rh = S'lid E = {)-/n)YAi=i0^i ~ Introducing the 

Frechet derivative [14] D(rn,T,)N at (m, E) of the function 

N:Q^H{k), (m, E) ^ iV[m, E] , 
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our bootstrap method relies on the following approximation 


^/n [jip - iV[m, S]) ~ a/u ( pp - iV[mo, Sp]^ -D(^rao,T.o)^[rh - mo, S - Sq] 1 

\ =p.p under 'Hq / 

- - mo, (y* - mo)®' - So] . (4.7) 

Since (4.7) consists of a sum of centered independent terms (under Ho), it is possible to generate 
approximate independent copies of this sum via weighted bootstrap [8]. Given Zf,..., i.i.d. 
real random variables of mean zero and unit variance and Z^ their empirical mean, a bootstrap 
replication of (4.7) is given by 

1 " 

^^(Z‘-Z‘){ny;,.)-C(™.,!:„)A'ln(r.-mo)®"l} ■ (4.8) 


Taking the square of the norm of (4.8) in H{k) and replacing the unknown true parameters mo 
and So by their estimators m and S yields the bootstrap replication of nA^ 


/ Pp - E*] 


H(k) 


(4.9) 


where 


A‘p = (l/r!)53(Z‘-Z'')MK.,.) , 

i=l 

n 

- Z'’)Y, , 

i=l 

n 

= (l/p) J2iZi - Z^){Yi - m^)®'. 

i=l 

Therefore this approach avoids the recomputation of parameters for each bootstrap repli¬ 
cation, hence a computationnal cost of order 0{Bn^) instead of 0{Bn^). This is illustrated 
empirically in the right half of Figure 1. 

4.2.3 Fast parametric bootstrap for general parameter estimators 

The bootstrap method proposed by [26] used in Section 4.2.2 requires that the estimators 
[m, S) can be written as a sum of independent terms with an additive term which converges 
to 0 in probability. Formally, (m,S) = (mg, Sq) -|- + e where = 0, 
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Var('0(y)) < +CX) and e —>■ 0. However there are some estimators which cannot be written 

n—^-\-oo 

in this form straightforwardly. This is the case for instance if we test whether data follow a 
Ganssian with covariance of hxed rank r (as in Section 6 ). In this example, the associated 
estimators are fh = m = (empirical mean) and E = ^ where 

(As)s and (tl/s)<j are the eigenvalnes and eigenvectors of the empirical covariance operator E = 

We extend (4.9) to the general case when ©o 7 ^ 0 and the estimators (m, E) are not the 
classical (m, S). We assnme that the estimators (m, E) are fnnctions of the empirical estimators 
m and S, namely there exists a continnons mapping T snch that 

{rfi, S) = T(m, S), where T( 0 ) C ©g and T\eo = We^. 


Under this dehnition, (m, E) are consistent estimators of (m, E) when (m, E) G ©q. This kind 
of estimators are met for varions choices of the nnll-hypothesis: 

• Unknown mean and covariance: (m, E) = (m, E) and T is the identity map Ide, 

• Known mean and covariance: [rfi, E) = (mg, Eg) and T is the constant map 
r(m, E) = (mg, Eg), 

• Known mean and unknown covariance: (m,E) = (mo,E) and T(m, E) = (mg,E), 

• Unknown mean and covariance of known rank r: (m, E) = (m, E^) and T(m, E) = 
(m, Er) where E^ is the rank r trnncation of E. 

By introdncing T, we get a similar approximation as in (4.7) by replacing the mapping N : 
©g —)■ H{k) with NoT : ©o H{k). This leads to the bootstrap replication 


[nA 


2-\b 


J fast 




2 

Hik) 


(4.10) 


The validity of this bootstrap method is jnstihed in Section 4.2.4. 

Finally we dehne an estimator „ of from the generated B bootstrap replications 
[nA^Yj^g^ < ... < [nAYYast (assnming they are sorted) 

where [.J stands for the integer part. The rejection region is dehned by 

Ka = {nA? > qa,n} ■ 
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4.2.4 Validity of the fast parametric bootstrap 

Proposition 4.4 hereafter shows the validity of the fast parametric bootstrap as presented in 
Section 4.2.3. The proof of Proposition 4.4 is provided in Section B.2. 

Proposition 4.4. Assume ¥.pk^/^{Y,Y), Tr(E) and¥.p\\Y — moil‘d are finite. Also assume 
that T is continuously differentiable on ©q. 

//Ho is true, then for each b = 1,..., B, 

fi) V^(fip-N[m,E]) A Gp-D(™„,So)(iVor)[//p] 

V / n—>^+00 

(n) ^ (A‘p - r>,^^(Afor)lm^ E‘|) Gp - D^..„,^.,(NoT)lU'p] 

where {Gp,Up) and {G'p,U'p) are i.i.d. random variables in H{k) x 0. 

If otherwise Hq is false, {ii) is still true. 

Furthermore, Gp and Up are zero-mean Gaussian r.v. with covariances 

Var(Gp)=Ey..p(/(V,.)-/2p)®2 

Var (Up) = Ey^p [V - mo, (V - mo)®' - S] 
cov (Gp, Up) = EyMHY, •) - fip) ® - mo, (Y - mo)® - Eq] . 

By the Continons Mapping Theorem and the continnity of Proposition 4.4 guar¬ 

antees that the estimated quantile converges almost surely to the true one as n,B ^ +oo, so 
that the type-I error equals a asymptotically. 

Note that in [26] the parameter subspace ©o must be a subset of for some integer p > 1. 
Proposition 4.4 allows ©o to be a subset of a possibly inhnite-dimensional Hilbert space (m 
belongs to FL and E belongs to the space of hnite trace operators Ft ^ Ft). 

Figure 1 (left plot) compares empirically the bootstrap distribution of and the 

distribution of nA^. When n = 1000, the two corresponding densities are superimposed and 
a two-sample Kolmogorov-Smirnov test returns a p-value of 0.978 which conhrms the strong 
similarity between the two distributions. Therefore the fast bootstrap method seems to provide 
a very good approximation of the distribution of niSA even for a moderate sample size n. 

5 Test performances 

5.1 An upper bound for the Type-II error 

Let us assume the null-hypothesis is false, that is P ^ A/(mo, Eq) or (mo, Eq) ^ ©o- Theorem 5.1 
gives the magnitude of the Type-ll error, that is the probability of wrongly accepting Hq. The 
proof can be found in Appendix B.3. 

Before stating Theorem 5.1, let us introduce or recall useful notation : 
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n 



n = 1000 


Figure 1: Left: Comparison of the distributions of nA^ (test statistic) and (fast 

bootstrap replication) when n = 1000. A Kolmogorov-Smirnov two-sample test applied to our 
simulations returns a p-value of 0.978 which confirms the apparent similarity between the two 
distributions. Right: Comparison of the execution time (in seconds) of both classical and 
fast bootstrap methods. 


• A = ll/xp- (Aor)[mo,So]]||^(^), 

* Q.OL^n ? 

. ml = Ep||D,„.,s,)(A'or)l'I'(y)] - k(Y ,.) + Pp\\l,j,y 

where tI/(F') = (Y — uiq, [Y — ruo]®^ — Sq) and Zl(mo,Eo)(-^oT) denotes the Frechet deriva¬ 
tive of NoT at (mo, So). According to Proposition 4.4 and the continuous mapping theo¬ 
rem, qa,n corresponds to an order statistic of a random variable which converges weakly to 
IlGp — D(rno,Eo){NoT)[Up]\\ (as defined in Proposition 4.4). Therefore, its mean tends to 
a hnite quantity as n —)■ -|-oo. L and mp do not depend on n as well. 

Theorem 5.1. (Type II error) Assume sup^j^g^p^j \k{x,y)\ = M < -|-oo where Y G "Ho A TL 
P-almost surely and qa,n is independent of niS?. 
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Then, for any n > qa,n^ 


-2 


P (nA^ < qa,n ) < exp 






2mp + CmpM^/^{A^ — qa,n/n) 


f{a,B,M,A) 


where 


( 6 . 11 ) 


/(a, B, M, A) (1 + o„(l)) (^1 + + C"'AWm^ 

and C,C ,C are absolute constants and Cpb only depends on the distribution of 
The first implication of Proposition (5.1) is that our test is consistent, that is 


P(nA^ < qa,n I Ho false) —)■ 0 

’ n^+oo 


Furthermore, the upper bound in (5.11) reflects the expected behaviour of the Type-II error 
with respect to meaningful quantities. When A decreases, the bound increases (alternative 
more difficult to detect). When a (Type-I error) decreases, q^^n gets larger and n has to be 
larger to get the bound. The variance term rup encompasses the difficulty of estimating fip 
and of estimating the parameters as well. In the special case when m and S are known, 
T = Id and the chain rule yields D^rno,T.o){^oT) = (T)7-(mo,Eo)-^)o(T*(mo,So)'7~) = 0 so that 
rup = E||0(y) —/lp|p reduces to the variance of flp. As expected, a large rup makes the bound 
larger. Note that the estimation of the critical value which is related to the term /(a, B, M, A) 
in (5.11) does not alter the asymptotic rate of convergence of the bound. 

Remark that assuming that qci,n is independent of uA^ is reasonable for a large n, since nA^ 
and qa,n are asymptotically independent according to Proposition 4.4. 


5.2 Empirical study of type-I/II errors 

Empirical performances of our test are inferred on the basis of synthetic data. For the sake of 
brevity, our test is referred to as KNT (Kernel Normality Test) in the following. 

One main concern of goodness-of-£t tests is their drastic loss of power as dimensionality 
increases. Empirical evidences (see Table 3 in [40]) prove ongoing multivariate normality tests 
suffer such deficiencies. The purpose of the present section is to check if KNT displays a good 
behavior in high or infinite dimension. 

5.2.1 Finite-dimensional case (Synthetic data) 

Reference tests. The power of our test is compared with that of two multivariate normality 
tests: the Henze-Zirkler test (HZ) [22] and the energy distance (ED) test [40]. The main idea 
of these tests is briefly recalled in Appendix A.l and A.2. 
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Figure 2: Type-I and type-II errors of KNT (+ blue), Energy Distance (o black), and Henze- 
Zirkler (A red). Two alternative distributions are considered: HAl (rows 1 and 3) and HA2 
(rows 2 and 4). Two settings are considered: d = 2 (left) and d = 100 (right). 


































Null and alternative distributions. Two alternatives are considered: a mixtnre of two 
Ganssians with different means (/xi = 0 and /i 2 = 1-5 (1,1/2,..., 1/d)) and same covariance 
S = 0.5 diag(l, 1/4,..., 1/d?), whose mixture proportions equals either (0.5,0.5) (alternative 
HAl) or (0.8, 0.2) (alternative HA2). Furthermore, two different cases for d are considered: 
d = 2 (small dimension) and d = 100 (large dimension). 

Simulation design. 200 simulations are performed for each test, each alternative and each 
n (ranging from 100 to 500). B is set al B = 250 for KNT. The test level is set at a = 0.05 for 
all tests. 

Results. In the small dimension case (Figure 2, left column), the actual Type-I error of all 
tests remain more or less around a (±0.02). Their Type-II errors are superimposed and quickly 
decrease down to 0 when n > 200. On the other hand, experimental results reveal different 
behaviors as d increases (Figure 2, right column). Whereas ED test lose power, KNT and HZ 
still exhibits small Type-II error values. Besides, ED and KNT Type-I errors remain around 
the prescribed level a while that of HZ is close to 1, which shows that its small Type-II error 
is artihcial. This seems to conhrm that HZ and ED tests are not suited to high-dimensional 
settings unlike KNT. 

5.2.2 Infinite-dimensional case (real data) 

Dataset and chosen kernel. Let us consider the USPS dataset which consists of handwritten 
digits represented by a vectorized 8x8 greyscale matrix (A = M®^). A Gaussian kernel 
^g(') •) = exp(—cr^ll • — • IP) is used with cr^ = 10“^. Gomparing sub-datasets ”Usps236” 
(keeping the three classes ”2”, ”3” and ”6”, 541 observations) and ”Usps358” (classes ”3”, ”5” 
and ”8”, 539 observations), the 3D-visualization (Figure 3, top panels) suggests three well- 
separated Gaussian components for “Usps236” (left panel), and more overlapping classes for 
“Usps358” (right panel). 

References tests. KNT is compared with Random Projection (RP) test, specially de¬ 
signed for inhnite-dimensional settings. RP is presented in Appendix A.3. Several numbers of 
projections p are considered for the RP test : p = 1,5 and 15. 

Simulation design. We set a = 0.05 and 200 repetitions have been done for each sample 
size. 

Results. (Figure 3, bottom plots) RP is by far less powerful KNT in both cases, no matter 
how many random projections p are considered. Indeed, KNT exhibits a Type-II error near 
0 when n is barely equal to 100, whereas RP still has a relatively large Type-II error when 
n = 400. On the other hand, RP becomes more powerful as p gets larger as expected. A large 
enough number of random projections may allow RP to catch up KNT in terms of power. But 
RP has a computational advantage over KNT only when p = 1 where the RP test statistic is 
distribution-free. This is no longer the case when p > 2 and the critical value for the RP test 
is only available through Monte-Garlo methods. 
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USPS236 dataset 
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Figure 3: 3D-Visualization (Kernel PCA) of the ”Usps236” (top row, left) and ”Usps358” 
(top row, right) datasets; comparison of Type-II error (bottom row, left: ”Usps236”, right: 
”Usps358”) for: KNT (x blue) and Random Projection with p = 1 {• green), p = 5 (A purple) 
and p = 15 (+ red) random projections. 
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6 Application to covariance rank selection 

6.1 Covariance rank selection through sequential testing 

Under the Gaussian assumption, the null hypothesis becomes 


Ho • (^0) ^o) £ ©0 5 


and our test reduces to a test on parameters. 

We focus on the estimation of the rank of the covariance operator S. Namely, we consider 
a collection of models (■M.r)i<r<rmax s’^ch that, for each r = 1,..., Vmax, 

Air = {P = Af{m, S^) \ m G H{k) and rk(Sr) = r} . 

Each of these models correspond respectively to the following null hypotheses 

Ho^r ■ rank(S) = r, r = 1,..., , 

and the corresponding tests can be used to select the most reliable model. These tests are 
performed in a sequential procedure summarized in Algorithm 2. This sequential procedure 
yields an estimator f defined as 

f = min {Ho r rejected for r = 1,..., r — 1 and Hq ^ accepted} . 

r ’ ’ 

or f = Tmax If all of the hypotheses are rejected. 

Sequential testing to estimate the rank of a covariance matrix (or more generally a noisy 
matrix) is mentionned in [30] and [31]. Both of these papers focus on the probability to select 
a wrong rank, that is P(r ^ r*) where r* denotes the true rank. The goal is to choose a level 
of confidence a such that this probability of error converges almost surely to 0 when n —>■ +oo. 

There are two ways of guessing a wrong rank : either by overestimation or by underestima¬ 
tion. Getting f greater than r* implies that the null-hypothesis was tested and wrongly 
rejected, hence a probability of overestimating r* at most equal to a. Underestimating means 
that at least one of the false null-hypothesis • • • ,Ho^r*-i was wrongly accepted (Type-II 
error). Let /3r{a) denote the Type-II error of testing iLo,r with conhdence level a for each 
r < r*. Thus by a union bound argument. 


r*-l 

P(f 7 ^ r*) < /3r(tt) -t- a . (6.12) 

r=l 

The bound in (6.12) decreases to 0 only if a converges to 0 but at a slow rate. Indeed, the 
Type-II errors /3r{a) grow with decreasing a but converge to zero when n —)■ -|-cx). For instance 
in the case of the sequential tests mentionned in [30] and [31], the correct rate of decrease for 
a must satisfy (1/n) Iog(l/Q!) = Ori(l)- 
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Algorithm 2 Sequential selection of covariance rank 

Input: Gram matrix K = [k{Yi,Yjy\ij, confidence level 0 < a < 1 

1. Set r = 1 and test 

2. If H^ r is rejected and r < set r = r + 1 and return to 1. 

3. Otherwise, set the estimator of the rank f = r. 

Output: estimated rank r 


6.2 Empirical performances 

In this section, the sequential procedure to select covariance rank (as presented in Section 6.1) 
is tested empirically on synthetic data. 

Dataset A sample of n zero-mean Gaussian with covariance are generated, where n 
ranges from 100 to 5000. is of rank r* = 10 and its eigenvalues decrease either polynomially 
(A,. = r~^ for all r < r*) or exponentially (A^ = exp(—0.2r) for all r < r*). 

Benchmark To illustrate the level of difficulty, we compare our procedure with an oracle 
procedure which uses the knowledge of the true rank. Namely, the oracle procedure follows our 
sequential procedure at a level Uorade defined as follows 

Oiorade = HiaX Pz(nA^ < ^r) , 
l<r<r*—1 

where riA^ is the observed statistic for the r-th test and Zr follows the distribution of this 
statistic under hfo,r- Hence aoracie is chosen such that the true rank r* is selected whenever it 
is possible. 

Simulatiou design To get a consistent estimation of r*, the confidence level a must de¬ 
crease with n and is set at a = = exp(—0.125n°'^^). Each time, 200 simulations are 

performed. 

Results The top panels of Figure 4 display the proportion of cases when the target rank 
is found, either for our sequential procedure or the oracle one. When the eigenvalues decay 
polynomially, the oracle shows that the target rank cannot be almost surely guessed until 
n = 1500. When n < 1500, our procedure finds the true rank with probability at least 0.8 and 
quickly catches up to the oracle as n grows. In the exponential decay case, a similar observation 
is made. This case seems to be easier, as our procedure performs almost as well as the oracle 
when n > 600. In all cases, the consistency of our procedure is confirmed by the simulations. 

The bottom panels of Figure 4 compare a with the probability of overestimating r* (denoted 
by p+). As noticed in Section 6.1, the former is an upper bound of the latter. But we must 
check empirically whether the gap between those two quantities is not too large, otherwise the 
sequential procedure would be too conservative and lead to excessive underestimation of r*. In 
the polynomial decay case, the difference between a and is small, even when n = 100. The 
gap is larger in the exponential case but gets broader when n < 1500. 
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Figure 4: Top half: Probabilities of finding the right rank with respect to n for our sequential 
test (• red) and the oracle procedure (A blue); bottom half: probabilities of overestimating 
the true rank with the sequential procedure compared with fixed alpha (+ green). In each case, 
two decreasing rate for covariance eigenvalues are considered : polynomial (left column) and 
exponential (right column). 


6.3 Robustness analysis 

In practice, none of the models Air is true. An additive full-rank noise term is often considered 
in the literature [10, 25]. Namely, we set in our case 

Y = Z + e (6.13) 

where Z ~ Af{m, S^*) with rk(Sr*) = r* and e is the error term independent of Z. Note that 
the Gaussian assumption concerns the main signal Z and not the error term whereas usual 
models assume the converse [10, 25]. 

Figure 5 illustrates the performance of our sequential procedure under the noisy model 
(6.13). We set V. = n = 600, r* = 3 and = S 3 = diag(Ai,..., A 3 , 0,..., 0) where 
Xr = exp(—0.2r) for r < 3. The noise term is e = (A 3 p“^r 7 j)i<j<ioo where rji,..., r/ioo are i.i.d. 
Student random variables with 10 degrees of freedom and p > 0 is the signal-to-noise ratio. 
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Figure 5: Illustration of the robustness of our sequential procedure under a noisy model. 


As expected, the probability of guessing the target rank r* decreases down to 0 as the 
signal-to-noise ratio p diminishes. However, choosing a smaller level of conhdence a allows 
to improve the probability of right guesses for a hxed p. without sacrihcing much for smaller 
signal-to-noise ratios. This is due to the fact that each null-hypothesis is false, hence the 
need for a smaller a (smaller Type-I error) which yields greater Type-II errors and avoids the 
rejection of all of the null-hypotheses. 

7 Conclusion 

We introduced a new normality test suited to high-dimensional Hilbert spaces. It turns out to 
be more powerful than ongoing high- or inhnite-dimensional tests (such as random projection). 
In particular, empirical studies showed a mild sensibility to high-dimensionality. Therefore our 
test can be used as a multivariate normality (MVN) test without suffering a loss of power when 
d gets larger unlike other MVN tests (Henze-Zirkler, Energy-distance). 

If the Gaussian assumption is validated beforehand, our test becomes a general test on 
parameters. It is illustrated with an application to covariance rank selection that plugs our 
test into a sequential procedure. Empirical evidences show the good performances and the 
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robustness of this method. 

As for future improvements, investigating the influence of the kernel k on the performance 
of the test would be of interest. In the case of the Gaussian kernel for instance, a method to 
optimize the Type-II error with respect to the hyperparameter a would be welcomed. This 
aspect has just began to be studied in [21] when performing homogeneity testing with a convex 
combination of kernels. 

Finally, the choice of the level a for the sequential procedure (covariance rank selection) is 
another subject for future research. Indeed, an asymptotic regime for a has been exhibited to 
get consistency, but setting the value of a when n is fixed remains an open question. 
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A Goodness-of-fit tests 


A.l Henze-Zirkler test 


The Henze-Zirkler test [22] relies on the following statistic 


HZ 


4 >(«) - >!>(«) 


2 

Uj{t)dt , 


(A. 14) 


where denotes the characteristic fnnction of A/'(0,/), T(f) = n ^ is the em¬ 

pirical characteristic fnnction of the sample Ti,... , 14 , and oj{t) = (27r/3)“'^/^ exp(—| |f| p/(2/?)) 
with (3 = 2-V2[(2d + l)n)/4]V('i+4). The Hq -hypothesis is rejected for large valnes of HZ. Note 
that the sample Yi,..., 14 must be whitened (centered and renormalized) beforehand. 


A.2 Energy distance test 

The energy distance test [40] is based on 

T(P,Po) = 2'E\\Y -Z\\ - ^\Y -Y'\\-^\Z - Z'W (A.15) 


which is called the energy distance^ where Y,Y' P and Z, Z' ~ Pq. Note that S{P, Pq) = 0 if 
and only if P = Pq. The test statistic is given by 


£ =-y^Ezm- Z\\-Ez,z'\\Z - Z’\ 
n 


2 = 1 




(A.16) 


*j=i 


where Z,Z' A/'(0,/) (null-distribution). HZ and ED tests set the Ho-distribution at 
Pq = A^(/i, S) where fi and S are respectively the standard empirical mean and covariance. As 
for the Henze-Zirkler test, data must be centered and renormalized. 

A.3 Projection-based statistical tests 

In the high-dimensional setting, several approaches share a common idea consisting in projecting 
on one-dimensional spaces. This idea relies on the Cramer-Wold theorem extended to inhnite 
dimensional Hilbert space. 
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Proposition A.l. (Prop. 2.1 from [24]) LetPL he a separable Hilbert space with inner product 
(■, ■), o,nd Y, Z eH denote two random variables with respective Borel probability measures Py 
and Pz. If for every h E PL, {Y, h) = (Z, h) weakly then Py = Pz. 

[24] suggest to randomly choose directions h from a Gaussian measure and perform a 
Kolmogorov-Smirnov test on (Yi, h),..., (Yn, h) for each h, leading to the test statistic 

Dnih) = sup \Fn{x) - Fo(a;)| (A. 17) 


where Fn{x) = (1/u) is the empirical cumulative distribution function (cdf) of 

i=l 

(Yi, h),..., {Yn, h) and Fq{x) = P((Y, h) < x) denotes the cdf of {Z, h). 

Since [24] proved too few directions lead to a less powerful test, this can be repeated for 
several randomly chosen directions h, keeping then the largest value for Dn{h). However the test 
statistic is no longer distribution-free (unlike the univariate Kolmogorov-Smirnov one) when 
the number of directions is larger than 2. 

B Proofs 

Throughout this section, (.,.) (resp. ||.||) denotes either {.,.)n or ( . ,.)H{k) (resp. ||.||-^ or 
||.||/^(fc)) depending on the context. 

B.l Proof of Propositions 4.2 and 4.3 

Consider the eigenexpansion of S = X]i>i where Ai, A 2 ,... is a decreasing sequence of 

positive reals and where form a complete orthonormal basis of PL. 

Let Z ~ Ar{rh, E). The orthogonal projections {Z, Tj) are Af{0, Aj) and for i ^ j, {Z, Tj) 
and {Z, Tj) are independent. Let Z be an independent copy of Z. 

• Gaussian kernel case : k{.,. 


= exp(-(T||-11^) 
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Let us first expand the following quantity 


N[m, S](?/) = Ezexp {-a\\Z - y\\^) 
= Ezexp 


+ 00 


JjE^exp ( -a^{Z - 

i=l \ i>l 

+ 00 

JjE^exp {-a{Z - y, fj)^) 


(B.18) 


2 = 1 
H-oo 


n(i + 2crAr) ^'^^exp | —a 
- 1/2 


2=1 

I + 2aE 


{m - y,^ry‘‘ 


1 -|- 2ctAj. 

exp(—( t((/ + 2aE){y — fh), y — fh)) 


We can switch the mean and the limit in (B.18) by using the Dominated Convergence theorem 
since for every > 1 


N 


JJexp(-a(Z-i/,^i)2) 


i=l 


< 1 < +00 . 
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The second quantity ||iV[m, S]|p is computed likewise 
||iV[m,S]|p = E^^/exp (-a\\Z - Z'W^ 


Ez,z'exp (^-a^^{Z-Z',^,f^ 

+00 / 

n ^z,z' exp i-a'^{Z - Z',^i) 

i=l V i>l 

+00 

]^EzE^/ (^exp {-(j{Z - Z\ | Z 

i=l 
+00 

]^E^(1 + 2aXr)~^^‘^ exp 

i=l 
+00 

JJ(1 + 2aXr)~^^‘^Eur^M(0,\r) exp 


(m — Z, Tr)" 
1 T 2(j 

aU^ 


2=1 

+00 


JJ(l + 2 aA ,)-^/2 


2=1 

+00 


2(7 Ar 


1 + 2(tX, 


1 T 2(7Xr 
- 1/2 


JJ (1 + AaXr) 
- 1/2 


2=1 

/ + 4aS 


• Exponential kernel case : k {.,.) = exp({-, ■)y) 

Let us first expand the following quantity 

7V[m, S](i/) = Ezexp((Z,i/)) 

= exp((m,i/) + {l/2){ty,y)). 

Expanding ||iV[m, E]|p, 

||7V[^,S]||2 = E^_^.exp ((Z,Z')^ 

= E^_^.exp(5^(Z,vI/,)(Z',T,) 

V2>1 

OO 

= n ^z,z' exp [(Z, Tj) (z', Tj) 


2=1 


(B.19) 



We can switch the mean and the limit by nsing the Dominated Convergence theorem since 


N 




2=1 


N^oo 


exp 


{Z, Z )'j < exp 


\Z\? + \\Z 


and 




exp 



2 


m^/^{z,z) < +CX) . 


The integrability of Z) is necessary to ensnre the existence of the embedding N[m^ S] 

related to the distribntion of Z. As we will see thereafter, it is gnaranteed by the condition 
Ai < 1. 

For each i, 


exp = EzE^/ ^exp {{Z, Ti)(Z', | Z^ 

= E^exp 

= (1 - (B.20) 

Flagging (B.20) into Eqnation (B.19), 


2=1 


(B.21) is well defined only if |/ — E^| > 0. As we have 
positive. The non-nnllity also holds since 


assnmed that Aj < 1 for all z, it is 


nil - A?] = exp 

2=1 




(B.22) 


where we nsed the ineqnality: log(a:) < x — 1. Since E is a finite trace operator, its eigenvalnes 
converge towards 0 and A^ < A* < 1 for i large enongh. Tims, Tr(S^) is finite and it follows 
from (B.22) that |/ — S^| > 0. 
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B.2 Proof of Theorem 4.4 


The proof of Proposition 4.4 follows the same idea as the original paper [26] and broadens its 
held of applicability. Namely, the parameter space does not need to be a subset of anymore. 
The main ingredient for our version of the proof is to use the CLT in Banach spaces [23] instead 
of the multiplier CLT for empirical processes ([27], Theorem 10.1). 

We introduce the following notation : 

• 9o = (mo, So) denotes the true parameters 

• 6n = {m,!!) denotes the empirical parameters 

• For any 9 = (m, S) and y = {u — ~ — S). 

• «S,„ = (l/«) Er.i(Z‘ - < = (l/n) E”.i(Z.‘ - 

Dehne the covariance operators Ci : H(k) —)■ H{k), ^2 : 0 —)■ 0 and C' 1,2 ■ Q ^ H{k) as 


Cl = Var {i/n{jlp - /ip)) = Ey(fc(y,.) - /ip)^ 
C 2 = Var (n-i/2^T,„(V) ] = EyTeo(V)®2 


2=1 




Cl,2 = COV ^^o(C*) = Ey(fc(y,.) - pp) ® T,„(y) 




2=1 


From [23] , the CLT is verihed in a Hilbert space under the assumption of hnite second moment 
(satished in our case since Tr(Ci) = Epfc(V, Y) — \ |/ip| P < +cxd and ^((72) = Ep| |V — /io| |^ + 
Tr(So — Eg) < +CX) by assumption). Therefore, 


■\/^(hP ~ hp) —^ Gp^QV{ 0 ,Ci) 

n^+oo 


n 


- 1/2 




2=1 


n^+oo 


Up ~ gv{o, C 2 ) 


Introducing 

n 

Cl := Var ^E(Zi - ZfCi = (1 - l/n)Ci 

2=1 

n 

Cl := Var = n"' 5^E(Zi - ZfC^ = (1 - l/n)C '2 

2=1 

n 

Cp2 := COV (A/n6'o,„, i/n/ip) = ^E(Zi - Z)^C'i,2 = (1 - l/n)Ci,2 , 

2=1 
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we derive likewise 


A G'p^gv{Q,Ci) 

n—>+oo 

^ u'p~gv(a,C2) . 

n^+oo 

Since 

cov (V^/Xp, ^/n{flp - /2p)) = 0 , cov ^ g 

cov = 0 , cov ^(/xp -/2p)) = 0 , 

the limit Gaussian processes (Gp, Up) and (Gp, Up) are independent. 

Since Dq^^NoT) is assumed continuous w.r.t. 6*o, we get by the continuous mapping theorem 
that 

n 

^(jip - jip) - ^4 - -D».(A'or)[v«9S,„]l 

i=l 

converges weakly to 

(Gp - De,{Nor)[Up], G'p - De,{Nor)[Up]^ . 

To get the hnal conclusion of Proposition 4.4, we have to prove two remaining things. 

First, under the assumption that P = MiTiOo)), 

- NoTiOn]) =V^{fip - ftp) + V^iNoTiOo] - NoTiOn]) 

=y/n{flp - /2p) - Deo{NoT)[y/n{6n - 6*0)] + op(\/u||6'n - 6'o||) 

n 

=V^{fip - ftp) - De,{Nor)[n-^^^ Y1 

i=l 

+ De,{Nor)[op{l)] + op{V^\\9r, - 0o||) 

^ Gp-De,{NoT)[Up] , 

n^+oo 

because \/n{6n — Oq) converges weakly to a zero-mean Gaussian and by using the continous 
mapping theorem with the continuity of do h-)■ Dg^^N), a]nd of ||.||0. 

Secondly, whether Hq is true or not, 

- DsJNoPlV^ei] - D,„{NoT)lVfKn] + Di,,(NoT)lMel„ - <)] 

-v-^ 

: = {a) 

+ Dg, (NoT) [V^elJ - Dg„ (NoT) . 

-V-^ 

:=(6) 
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we must check that both (a) and {b) converge P-almost surely to 0. 

Since by the continuous mapping theorem 

(a)= ]p,„(iVor)[(mo-m,So-S„)] 0 , 

\ ^ ^ ^ ^ n^+oo 

vj_^ ^-0 a.s. 

—>Ar(0,l) 

and since 

( 6 ) = Dqo (NoT) [y/nO^ J - Dg^ (NoT) [y/nO^ J —> 0 P-almost surely , 
as 9n{u!) —)■ 6^0 P-almost surely and N is continuous w.r.t. 9, it follows 

- Dg^{Nor)[V^9i] -^G'p- De,{NoT)[U'p] , 
hence the conclusion of Proposition 4.4. 

B.3 Proof of Theorem 5.1 

The goal is to get an upper bound for the Type-II error 

P(nA2 < g I Hi) . (B.23) 

In the following, the feature map from H{k) to H{k) will be denoted as 

^:H{k)^HCk), y^Hy,.) . 

Besides, we use the shortened notation q := Kqa,n,B for the sake of simplicity (see Section 5.1 
for definitions). 

1. Reduce nA^ to a sum of independent terms 

The first step consists in getting a tight upper bound for (B.23) which involve a sum of 
independent terms. This will allow the use of a Bennett concentration inequality in the 
next step. 

Introducing the Frechet derivative Dg^^NoT) of NoT at 9o, riA^ is expanded as follows 

1 ” - 

=- ^ (0(F,) - Nori9n), 0(1^) - Nori9n)) 

:=nAQ-|-riA^-|-2?7,S'„-|-e . (B.24) 

where 

1 ” 

<=«)•=«)) • 

* 5=1 
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1 

Sn = - - Nor{do),^{Yi)) , 

2=1 


S(F,) = D,„(iVor)(^(F,))-0(F.)+/2p , 


and e = o„(l) almost surely. 

uAq is a degenerate U-statistics so it converges weakly to a sum of weighted chi-squares 
([35], page 194). ^JnSn converges weakly to a zero-mean Gaussian by the classic CLT as 
long as E(/ip — NoT{6Q),'E.{Yi))‘^ is hnite (which is true since k is bounded). It follows 
that Aq becomes negligible with respect to Sn when n is large. Therefore, we consider a 
surrogate for the Type-II error (B.23) by removing Aq with a negligible loss of accuracy. 

Plugging (B.24) into (B.23) 


P(nA^ < q) =P(nAQ -|- nL‘^ -|- 2nS'„ < q — e) 

= (1-|-o„(l)) P(nAQ-|--|-2nS'„ < g) . 

Finally, using Aq > 0 and conditionning on q yield the upper bound 

P(nA^ < g I g) < (1 + On(l)) P ( ^ /(f^i) > ns | , 


, 2=1 


where 


/(F,):= {jip-NoT{e,),E{Yi)) , s := _ i . 

n 


(B.25) 


(B.26) 


2. Apply a concentration ineqnality 

We now want to hnd an upper bound for (B.26) through a concentration inequality, 
namely Lemma B.l with = f{Yi), e = ns, = Var(/(yj)) and /(F)) < c = M 
(F-almost surely). 

Lemma B.l combined with Lemma B.4 and B.3 yields the upper bound 

P(y^ /(F)) > ns I g) < exp (-=— ] 1 ^>q -|- 1 ^^q 

:=exp(g(s))l 5 >o + l 5 <o :=/z(s) , (B.27) 

where 

M:= (4 + A 2 ) LM ^/2 ^ p 2 < ^2 _ ^2^1 ^ 
and nrp = E ||S(F)||^. 
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3. ’’Replace” the estimator qa,n with the true quantile qa,n in the bound 

It remains to take the expectation with respect to q. In order to make it easy, q is pull out 
of the exponential term of the bound. This is done through a Taylor-Lagrange expansion 
(Lemma B.5). 

Lemma B.5 rewrites the bound in (B.27) as 

+ (2/3)Mfe ) (W) ~ ^ 1 } ’ 

where 

s = , s = L'^ , q e {qAq,qy q) , 

n n 

and s > 0 because of the assumption n > qL~‘^. 

The mean (with respect to q) of the right-side multiplicative term of (B.28) is bounded 
by 



because of the Cauchy-Schwarz inequality. 

On one hand, E(g — —)• 0 (Lemma B.2) implies E(g — g)^ —)■ 0 and thus 

B —^-|-oo B —^-|-oo 

? —t goo weakly where q^c = lim„_,.+oo (la,n (that is almost surely for q^o is a constant). 

B^+00 ’ 

Hence 


Ea 



V M-d / 



E^ 


^ ^ OB{\q-q\) 



^ ^ Eg(oj3(|g-g|)lg>o) 


where ob denotes almost sure convergence. 

Since the variable |g — g|ls>o is bounded by the constant |nL^ — <?| V |g| for every R, it 
follows from the Dominated Convergence Theorem 


^ E^(oB(|g-g|)lg>o) _ ^ ou(l) 

On the other hand. Lemma B.2 provides 

>2 Eiq — qY Cip + aC2p/B Cpb 


(B.29) 


(B.30) 
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so that an upper bound for the Type-II error is given by 



9 

ns 



^ _l_ ^Cpb ^ ob{B 


1/2) 


(B.31) 


2792 + (2/3)M?9s 


which one rewrites as 


exp 





2m?p + CmpM^/‘^{L‘^ — q/n) 


where 


/i(B,M,L) = (3/2 + o„(l)) 1 + 


Cpb ob{B 

C'L‘^M^/^mpV^ ^ C'L^Mml 



P / 


and C = (2/3)(4 + ^2), C' = 2(4 + y2)/3 and C" = (4 + ^/2f. 


Theorem 5.1 is proved. 

B.4 Auxilary results 

Lemma B.l. (Bennett’s inequality, Theorem 2.9 in [4]) Let ^ 1 ,... ,Cn i-i-d. zero-mean vari¬ 
ables bounded by c and of variance P. 

Then, for any e > 0 



(B.32) 


Lemma B.2. (Theorem 2.9. in [5]) Assume a < 1/2. Then, 



(B.33) 


where Cpb only depends on the bootstrap distribution (of [nA2]^^^J. 

Lemma B.3. If Y G TLq C Ti P-almost surely and ~ 


y ^B-o 


\f{y)\ <M -.= {4 + V2 )lVm . 


(B.34) 
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Proof. 


\fiy)\ 


= \{fLp- Nor{eo),De,{Nor){^{y)) - ^{y) + fip) 

<L\\De,{NoT){^{y))-^iy)+fip\\ 


< L 


limE 




1/2 


+ ^k{y,y) + ^Jm{YX)^ 

< L (yM + + TzM + Vm + Vm) 

< L{A + V2)Vm :=M . 


Lemma B.4. 

Z/2 < ^2 XXp . 

Proof. 

z/2 ;= Var(/(F)) = Ef{Yi) 

=E(/2p - iVor(0o), Dg^{Nor)XiY)) - 0(F) + /xp)^ 

< XE\\De,{Nor)X{Y)) - 4>{Y) + fipf . 

---^ 


□ 


(B.35) 


□ 


Lemma B.5. Let h he defined as in (B.27). Then, 


h{s) < exp 


ns 


2x92 + (2/3) Mi9s 


3n /3|g —g| 


w/T M 1 ^xp ls>o|s - s 


2Md 


2Md 


(B.36) 


where 


s = Lf — - , s = Lf — - , ge(gAg, gVg) . 


n 


n 


Proof. A Taylor-Lagrange expansion of order 1 can be derived for h{s) since the derivative of 
h 


. i2/?>)nMdx^ + And'^x 

h (x) = - ^^^——=—^ exp 

^ ^ {2d^ + {2/3)Mdxf 


nx 


U 


2d^ + {2/3)Mdx 


*-a::>0 ? 
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is well defined for every a; G M (in particnlar, the left-side and right-ride derivatives at x = 0 
coincide). 

Therefore h{s) eqnals 
h{s) -|- h (s)(s — s) 

[l + exp(f7(s)-^7(5))f7'(s)l,->o(s-s)] , (B.37) 

where 

s = L'^ - - , 8 = 1“^--, ge(gAg,gVg), 

n n 

and s > 0 because of the assumption n > qL~‘^. 


For every x,|/ > 0, \g'{x)\ < 3n/(2M'd) and then \g{x)—g{y)\ < 3?7,|x —j/|/(2M'd). It follows 




3n 

2® ’ 


and 


exp {g{s) 


g{s))< exp 


3n 

2Md 



Lemma B.5 is proved. 


exp 


V 2Md ) 


□ 
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